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Abstract. Link prediction is an open problem in the complex network, which attracts much research 
interest currently. However, little attention has been paid to the relation between network structure and 
the performance of prediction methods. In order to fill this vital gap, we try to understand how the network 
structure affects the performance of link prediction methods in the view of clustering. Our experiments on 
both synthetic and real-world networks show that as the clustering grows, the precision of these methods 
could be improved remarkably, while for the sparse and weakly clustered network, they perform poorly. 
We explain this through the distinguishment caused by increased clustering between the score distribution 
of positive and negative instances. Our finding also sheds light on the problem of how to select appropriate 
approaches for different networks with various densities and clusterings. 
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1 Introduction 

Many real world data sets can be represented as net- 
works with nodes denoting objects and edges describing 
relationships between them [5] . Examples of complex net- 
works include the Internet, a collection of connected Au- 
tonomous Systems(AS), routers and interfaces in different 
levels. The online social network for people maintaining 
their friendship is another major instance. For the perva- 
sive existence of these networks, the last decade has wit- 
nessed the study of complex networks in the fields of both 
computer science and physics. An important issue rele- 
vant to the computational analysis of complex networks is 
the link prediction. Link prediction is a problem of both 
theoretical and practical significance. It aims to evaluate 
the likelihood of a link between two nodes not connected 
until now, based on the existing links information and 
possible node attributes information in the network [T2j . 
There are two aspects of link prediction problem: on the 
one hand, for most real network data, not all links are al- 
ready observed, link prediction helps to find the missing 
links; on the other hand, it can help us infer the new in- 
teractions between nodes in the new future. Research on 
link prediction is also helpful to accomplish some other 
tasks, like collective classification |4] and anomalous link 
discovery [T7]- 

The existing methods for link prediction can be divided 
into three categories. The first method defines a measure 
of proximity or similarity between two nodes in the net- 
work, taking into account that links between more simi- 
lar nodes are of higher existing likelihood. Liben-Nowell 
and Kleinberg |10j summarize many similarity measures 
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based on node neighborhoods, the ensemble of all paths 
and higher-level approaches. They compare these mea- 
sures with random predictors in five co-authorship net- 
works and find that there is indeed useful information 
contained in the network topology alone. Motivated by the 
resource allocation process taking place in networks, Zhou 
et al. [21] propose a new similarity measure, which has 
great performance in six representative networks drawn 
from different fields. Liu and Lii 11 put forward a method 
based on local random walk, which can give excellent pre- 
diction while has low computational complexity. The sec- 
ond method is based on the maximum likelihood estima- 
tion. Empirical studies suggest that many real- world net- 
works exhibit hierarchical organization. Clauset, Moore 
and Newman [5] present a method inferring hierarchi- 
cal structure from network data and use the knowledge 
of hierarchical structure to predict the missing links in 
partially known networks. The third method mainly uses 
machine learning techniques. Hasan et al [7] view link 
prediction as a supervised learning task: for two poten- 
tially connected nodes, predicting whether it is a posi- 
tive or negative example. The feature set extracted from 
the co-authorship graph contains proximity features, ag- 
gregated features and topological features. They exper- 
iment with seven different classification algorithms and 
compare the performance of these classifiers using dif- 
ferent performance metrics. O'Madadhain et al. |TB] use 
primarily probabilistic classifiers to predict future "co- 
participating" in event-based network data. There are also 
many works related to link prediction concerning more 
complicated networks, like directed and weighted networks. 
Leung et al. [3] propose a novel Link Formation Rules 
mining algorithm for social networks. Romero and Klein- 
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berg [12] investigate the directed closure process and an- 
alyze the link formation on twitter. In 14 , Murata and 
Moriyasu describe an improved method for predicting links 
on Question- Answering Bulletin Boards (QABB), kind of 
a social network in which each link is assigned a weight. 

Most of those works on link prediction aim to find 
a method with better prediction performance for some 
particular networks, such as the co-authorship network, 
terrorists network and so on. However, little was done to 
reveal how these existing methods perform on networks 
with different structural properties. In this paper, we try 
to find the relation between network structure and the pre- 
diction performance of these methods. In the real world, 
the attributes of nodes are usually difficult to collect and 
the simpleness of prediction methods is also necessary. For 
example, in online social networks, systems need to pro- 
vide a list of potential friends for a certain user with least 
load to the server. Because of this, in the present work, we 
focus on the first kind of methods which are solely based 
on the network structure. Through experiments on both 
synthetic and real- world networks, we find that for the net- 
work with low clustering, these methods perform poorly. 
Nonetheless, as the clustering of the network grows, the 
precision of these methods is drastically improved. These 
phenomena tell us that for the networks with various clus- 
terings, we should employ different methods for link pre- 
diction. 

This rest of the present paper is organized as follows: 
In Section [21 we review several similarity based methods 
for link prediction. In Section [31 the data sets we use in 
the paper are introduced. We investigate the connection 
between clustering and performance of prediction methods 
in Section [31 and we also give a brief explanation in this 
section. In Section [5l we conclude this work briefly. 



2 Preliminaries 

In this section, we first describe the link prediction prob- 
lem and introduce the evaluation metrics. Then we review 
several similarity-based methods. 

Suppose we have an undirected simple network G{V, E), 
where V is the set of nodes and E is the set of edges. Gen- 
erally, the number of a node's connections can be defined 
as its degree. The averaged degree of the network can be 
defined as 

2\E\ 



(fc) 



1^1 



which could be used to characterize the density of the 
network. We use p{k) to denote the degree distribution 
of the network and for the complex networks discussed in 
this paper, it always follows a power-law. The relative size 
of the giant connected component(G'CC) can be denoted 
as face- Clustering of a node i is used to characterize 
how closely its neighbors are connected. It can be defined 
as 

2\E,\ 



where Ei is the set of ties between i's neighbors and ki 
is the degree of i. We do not take the case of fc^ = 1 into 
consideration. The averaged clustering of the network can 
be defined as 



C 



\v\ 



(1) 



In the rest of paper, we omit the word "averaged" if there 
is no confusion in the context. 

For any pair of nodes (x, y), which is not existing in 
E, each similarity-based method defines a measure, i.e. a 
score s{x,y) is assigned according to the given network 
topology. Then we rank all of these scores of node pairs 
and a higher score means a higher probability that the 
corresponding link will emerge in the future or more likely 
be missed in the present sample. 

To test the prediction accuracy of each method, we 
adopt the approach used in [21]. The edge set E is ran- 
domly divided into two parts, including E"^^"-"^^ and E^'^^^ , 
respectively. The training set ij'^''''™ is supposed to be 
known information and E^"^^^ is the testing set consisting 
of missing links or links to occur in the future. The train- 
ing set contains 90% of links in E, and the remaining 10% 
of links are in the testing set. We use precision to quantify 
the accuracy of prediction measures, which is determined 
as follows. Let n denote the number of links in iJ^'^"* . We 
compute the score list based on G'(y, iJ^'""™) and rank the 
list in decreasing order. The first n pairs are taken and m 
denotes the size of the intersection of this set of pairs with 
the i?-^*^"*. Then the precision is P = m/n. 

We mainly explore six existing similarity-based mea- 
sures for link prediction, including (1) Common Neigh- 
bors(CN); (2) Adamic-Adar Indcx(AA); (3) Resource Allo- 
cation Index(RA); (4) Katz Index(Katz); (5) Rooted PageR- 
ank(PR); (6) Superposed Random Walk(SRW). A brief in- 
troduction of these methods is given as follows. 

Common Neighbors For a node x in G, N{x) de- 
notes the set of neighbors of x. The Common Neighbors 
measure is determined by the number of nodes that link 
to both X and y, that is to say, two nodes is more likely 
to be connected with more common neighbors. Therefore, 
the score can be defined as 



ix,y) = \Nix)nNiy)\ 



(2) 



Adamic-Adar Index In [2], to determine whether 
two personal home pages are strongly "related", Adamic 
and Adar define the similarity between two pages based 
on their shared features. For link prediction, this index 
assigns rarer connected node more weights, i.e., 



r,AA 



ix,y) 



E lAog(fc(z)), 



(3) 



zeN{x}nN(y) 



where k{z) is the degree of the node z. 

Resource Allocation Index Zhou et al. [21 consider 
such a process: for a pair of unconnected nodes x and y, 
X with a unit of resource can send some to y by sending 
averaged amounts to its neighbors. The more resource y 
receives from the more likely a link between x and y 
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exists. Therefore, the score between x and y is defined as 
s^'^{x,y)^ J2 (4) 

zeN(x)nN{y) 

Katz Index Katz Index is a path-ensemble based 
method. It sums over all paths between x and y. The more 
number of paths with short length, the higher the score 
is. It is defined as 

oo 

s^''*^(a;,y) = ^/?'.|pat/i<J, (5) 

1=1 

where (3 is an adjusting parameter and paths^^ y is the set 
of all paths with length I from x to y. As mentioned in 
[TD], we can get the score matrix 5*^°*^ by 

oo 

S^"*^ = ^P^A^ ^ {I - I3A)-^ - I, (6) 
1=1 

where / is the identity matrix and A is the adjacent matrix 
of G. 

Rooted PageRank Index The Rooted PageRank de- 
fines a random walk on the underlying graph G. A ran- 
dom walk starts from a node x, and iteratively moves to 
a neighbor of x chosen uniformly at random. We use the 
probability that a random walk starting from x runs into 
y as the indicator of similarity between x and y |19| . The 
s^^{x, y) under the Rooted PageRank is defined to be the 
stationary probability of y under such a random walk: 
with probability 1 — 13 returns to x at each step, moves to 
a random neighbor of the current node with probability 
p. Let 

Dij = when i j and 

T = DA, 

we have the score matrix S^^ 

SP^ = {l-l3)iI-/3T)-\ (7) 

Superposed Random Walk Liu and Lii JTj pro- 
pose the Superposed Random Walk Index, which focuses 
on just few-step random walk, rather than the stationary 
probability. The transition probability matrix is denoted 
as P, with Pxy = ttxy/kx, where axy represents the corre- 
sponding entry in A. Given a random walk starting at x, 
the probability that it locates at y after t steps is iTxyit). 
it'xiO) is a iV * 1 vector with x*'* element equals 1 and 
others equal 0. Then we have: 

= P^#,(t-l). (8) 

The similarity based on Local Random Walk is defined 

as: 

= + ^ (9) 



Table 1. Synthetic and real- world data sets 



Network 




\E\ 


(k) 


C 


face 


BA(1000,2) 


1000 


1997 


4 


0.027 


1 


BA(1000,5) 


1000 


4985 


10 


0.039 


1 


BAdOOO.lO) 


1000 


9945 


20 


0.064 


1 


BA(2000,5) 


2000 


9985 


10 


0.024 


1 


BA(4000,5) 


4000 


19985 


10 


0.017 


1 


Netscience 


1461 


2742 


3.75 


0.878 


0.26 


Power Grid 


4941 


6594 


2.67 


0.107 


1 


Politic Blog 


1224 


16715 


27.31 


0.36 


0.998 



The Superposed Random Walk Index superposes the con- 
tribution of independently moved walkers and in our con- 
figuration we compute the 3 steps SRW rather than the 
optimal-steps SRW. The score for the pair {x,y) can be 
defined as 

.^«^(x,y,t) = ^.,\«^(r). (10) 

T = l 

Through the measure of precision and these prediction 
methods, we then perform experiments on both the syn- 
thetic and real-world networks that will be introduced in 
the next section. 



3 Data Sets 

The complex networks are pervasively existing in the real- 
world. Empirical study suggests that most complex net- 
works exhibit the "scale-free" property, which means p(fc) cx 
k~'' . Barabasi and Albert [3 proposed a scale-free network 
model to explain the generation mechanism of the "power- 
law" distribution, known as the BA model. We utilize this 
model to generate the synthetic networks. We denote the 
network generated by the BA model as BA{N,m), where 
N is the size of the network generated, m is the number 
of links that a new node will establish when it is added to 
the network and the averaged degree is 2m. We generate 
five networks in this paper. 

We also import three typical real-world complex net- 
works collected from different fields. Netscience is a net- 
work of co-authorships between scientists who are them- 
selves publishing on the topic of network science |15) . 
There are 1589 scientists in this network and 128 of them 
are isolated. We will not use these isolated nodes in our ex- 
periment. Power Grid is a well-connected electrical power 
grid of western US, where nodes denote generators, trans- 
formers and substations and edges denote the transmis- 
sion lines between them [50]. Politic Blog is a directed 
network of US political blogs [T]. Here we treat its links 
as undirected and self-connections are omitted. 

The detailed descriptions of these data sets are listed 
in Table [B 
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Fig. 1. Link prediction on BA(1000,5) with varying clustering 



4 How Clustering affects Predicting Precision 

In this section, we first perform experiments on syntlietic 
networks with various clusterings and unveil the relation 
between the network structure and the precision of link 
prediction methods. Then we validate our findings on the 
representative real-world data sets. Finally, we give an ex- 
planation based on class distribution for the phenomenon. 



4.1 Results from Synthetic Networks 



competitive method, and while C grows to a certain value, 
AA and RA seem to be better. The results from other syn- 
thetic data sets are similar. For example, Table[5]shows the 
result from BA(4000,5). We can see that the correlated 
characteristic between C and prediction value of different 
methods does not vary with the size and density of the net- 
work. It is also interesting that for the method of Katz, 
its performance depends on the value of /? greatly. For in- 
stance, when /3 — 0.0005, it performs best as clustering 
grows. This phenomenon means that the nearest neigh- 
bors play a vital role in the prediction for Katz, however, 
considering the further hops is unnecessary. 

Meanwhile, as shown in Fig. [21 we choose three rep- 
resentative clusterings, i.e., C — 0.1, C = 0.3 and C — 
0.5, to observe how these methods perform on networks 
with different densities when the size and clustering of 
the networks are constant. It is easy to learn from Fig. [5] 
that these methods perform better on denser networks. 
However, for the sparse network with low clustering, say 
C 



0.1, as shown in Fig. 2(a) 



SRW performs best com- 
pared with other approaches when m = 2. Nevertheless, 
the situation changes when the clustering of the network 
grows, AA and RA perform better, too. In particular, RA is 
the best way among these methods for the dense network 
with high clustering, as shown in Fig 



2(c) 



In summary, we find that on the synthetic networks 
generated by BA model, when the clustering grows, the 
performance of these prediction methods improves. How- 
ever, a natural question is whether similar phenomenon 
can be found in real-world networks. Therefore, we vali- 
date this finding on the real-world data sets in the next 
subsection. 



We investigate the relationship between the clustering and 
the precision of link prediction in this subsection. In order 
to unveil this relationship, the variation of clustering of the 
network is necessary. For this reason, we use the method 
proposed by Kim et al. [8| to rewire the links randomly and 
achieve the purpose of varying the clustering but without 
changing the degrees of the nodes. In particular, we first 
randomly pick up two edges, say {A,B) and {C,D). We 
then compare the numbers of local triangular structures 
associated with all three configurations {{A,B), {C,D)}, 
{{A, C), (B, D)} and {{A, D), {B, C)}, and select the one 
with most triangles and connect the nodes accordingly, 
where duplicated links are avoided [13]. It is worthy to 
note that, in this process, if a link from a given node is 
detached, a different link is immediately attached to this 
node. We continue this process until a desirable value of 
C of the network is attained. In this approach, the degree 
sequence of the network are fixed, and the only topological 
property changed is C. 

We perform the six methods on different generated net- 
works with various clusterings. The results from BA ( 1000 , 5) 
are shown in Fig. [TJ As for Katz and PR, we only choose 
the best situation to represent them, i.e. /3 ~ 0.0005 for 
Katz and /3 = 0.1 for PR. It can be seen that while the 
clustering C increases, all these similarity-based methods 
have better prediction performance. In networks with rel- 
atively small C, e.g. C < 0.01, there doesn't exist a more 



4.2 Validation on Real-world Data Sets 

The result of link prediction experiments on these real- 
world networks is shown in Table [3l which is consistent 
with the above simulation experiment. We can see that the 
prediction methods perform best on Netscience which 
has the largest clustering and worst on Power Grid with 
the least C. 

Based on the validations above, we can conjecture that 
in real- world networks, the performance of these link pre- 
diction methods is closely related to their clusterings. That 
is, for the network with higher clustering, these methods 
perform better. However, when the clustering decreases, 
their precision drops. 

4.3 An Explanation Based on Class Distribution 

In this subsection, we try to explain the finding in the view 
of class distribution. Here we treat the pair of connected 
nodes as a positive instance while the pair of disconnected 
nodes is a negative instance. As mentioned in |17j . the 
highly skewed distribution of positive and negative ex- 
amples yields computational cost of all node pairs and 
increases the variance of the prediction model. We as- 
sume that the scores of each particular link prediction 



Xu Feng, Jichang Zhao, Ke Xu: Link Prediction in Complex Networks: A Clustering Perspective 5 



Table 2. The result from BA(4000,5) 



c 


CN 


AA 


RA 


SRW 




Katz 






PR 


1 












/? = 0.05 


« — 005 


5 — 0005 


/? = 0.1 


/3 = 0.5 


(3 = 0.9 


0.0176 


0.0134 


0.0128 


n 0096 


0.0145 


n 0003 


0.0134 


0.0150 


q 91R-05 


5 n4E-05 


0.0012 


n 1 

U. i 


u.uoyo 


u.uoyo 




u.uouy 


u.uioy 


U.Uolo 


U.UoUO 




U.UUOO 


U.UUOl 


0.15 


0.0557 


0.0901 


0.0846 


0.0704 


0.0321 


0.0404 


0.0383 


0.0091 


0.0095 


0.0105 


0.2 


0.0722 


0.1189 


0.1159 


0.0932 


0.0487 


0.0534 


0.0544 


0.0205 


0.0183 


0.0185 


0.25 


0.0925 


0.1461 


0.1430 


0.1132 


0.0592 


0.0638 


0.0657 


0.0358 


0.0334 


0.0326 


0.3 


0.1168 


0.1781 


0.1726 


0.1342 


0.0616 


0.0832 


0.0800 


0.0522 


0.0532 


0.0501 


0.35 


0.1432 


0.2130 


0.2037 


0.1610 


0.0937 


0.1054 


0.1067 


0.0744 


0.0702 


0.0690 


0.4 


0.1785 


0.2514 


0.2405 


0.1827 


0.1060 


0.1383 


0.1410 


0.0979 


0.0943 


0.0898 


0.45 


0.2162 


0.2898 


0.2774 


0.2140 


0.1203 


0.1790 


0.1841 


0.1184 


0.1184 


0.1104 


0.5 


0.2573 


0.3279 


0.3163 


0.2458 


0.1320 


0.2391 


0.2346 


0.1541 


0.1519 


0.1364 


0.55 


0.3074 


0.3704 


0.3616 


0.2824 


0.1152 


0.2983 


0.3025 


0.1882 


0.1853 


0.1702 


0.6 


0.3450 


0.4132 


0.4081 


0.3154 


0.1241 


0.3226 


0.3292 


0.2267 


0.2143 


0.1903 




CN AA RA SR¥ Katz PR ° CN AA RA SRW Katz PR ° CN AA RA SRW Katz PR 



(a) C = 0.1 (b) C = 0.3 (c) C = 0.5 

Fig. 2. P varies with density of the network 



Table 3. The result from real- world data sets 















Katz 






PR 




Network 


CN 


AA 


RA 


SRW 




















[3 = 0.05 


13 = 0.005 


13 = 0.0005 


/3 = 0.1 


13 = 0.5 


13 = 0.9 


Netscience 


0.4494 


0.6666 


0.6805 


0.5760 


0.3796 


0.4569 


0.4423 


0.3734 


0.3817 


0.3013 


Power Grid 


0.0438 


0.0281 


0.0252 


0.0227 


0.0085 


0.0067 


0.0110 


0.0085 


0.0067 


0.0110 


Politic Blog 


0.1724 


0.1712 


0.1497 


0.1421 


0.0309 


0.1776 


0.1733 


0.0141 


0.0288 


0.0537 



method are drawn from separate distributions for linked 
and non-linked node pairs. In principle, the similarity- 
based method for link prediction tries to distinguish the 
two distributions of positive and negative examples by the 
corresponding scores. Next, we focus on how the distribu- 
tion of scores on two types of node pairs varies as the 
network structure changes. 



As shown in Fig. |3(a)[ Fig. |3(b)| and Fig. |3(c)[ for CN, 
we can see a clear separation between the distributions of 
s*^^ on positive and negative pairs while C of the net- 
work increases from 0.039 to 0.5. T his tre nd c an als o be 
obs erved with RA, as shown in Fig. |3(d)[ Fig. 3(e) and 
Fig. 3(f) As clustering of the network increases, node pairs 
with higher scores are more likely positive instances. Re- 
member that in the link prediction process, a higher score 
means a higher probability that a link will emerge or more 
likely be missed, so we can conclude that these prediction 
methods are more effective in networks with greater clus- 



terings. In summary, the increment of clustering improves 
the capability of these methods for distinguishing the pos- 
itive and negative node pairs, which leads to a higher pre- 
diction precision as the experiment shows. 



5 Conclusion 

Link prediction is an open problem in the complex net- 
work, which attracts wide attention in recent years. Plenty 
of methods have been presented, some of which are solely 
based on the structure while some of which take other fea- 
tures of the network into account. However, in the real- 
world, the simpleness and freedom of need for rich at- 
tributes are necessary to the practical methods. For this 
reason, we mainly investigate the relationship between six 
structural approaches and the clustering of networks. It is 
interesting that we find the performance of these methods 
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improves tremendously as the clustering increases both 
on synthetic and real-world networks. We also give this 
a brief explanation through the extent of distinguishment 
between the distribution of positive and negative instances 
caused by the variation of clustering. Our finding also 
sheds light on the problem of how to choose a simple but 
effective method when we meet real networks with vari- 
ous clusterings. We conjecture that for the sparse network 
with lower clustering, SRW is the best choice, while for 
the network which is dense and highly clustered, the best 
choice is RA. 
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